The World Happiness Report is a landmark survey of the state of global happiness.The happiness scores and rankings use data from the Gallup World Poll (GWP). The scores are based on answers to the main life evaluation question asked in the poll. This question, known as the Cantril ladder, asks respondents to think of a ladder with the best possible life for them being a 10 and the worst possible life being a 0 and to rate their own current lives on that scale.
Further, the Happiness Report includes additional 6 factors (levels of GDP, life expectancy, generosity, social support, freedom, and corruption) which show the estimated extent to which each of the six factor is estimated to contribute to making life evaluations (happiness score) higher in each country than in Dystopia. The underlying raw datapoints for those estimations are provided by other organisations (e.g. WHO) or from the Gallup World Poll question results. Dystopia in this context, is a hypothetical country with values equal to the world’s lowest national averages for each of the six factors raw values. The purpose in establishing Dystopia is to have a benchmark against which all countries can be favorably compared (no country performs more poorly than Dystopia) in terms of each of the six key variables. Since life would be very unpleasant in a country with the world’s lowest incomes, lowest life expectancy, lowest generosity, most corruption, least freedom, and least social support, it is referred to as “Dystopia,” in contrast to Utopia.
Thus, each of the 6 factors values explain the contribution of each factor for the higher happiness score in a certain country than in Dystopia. That is why the happiness score can be calculated by: \[\sum_{i=1}^{6} factorvalue_i + dystopiahappiness + residual \]
This makes it clear, that the 6 factors are already the result of some sort of estimation and therefore cannot be used for analysing the variable importance. The resulting regression coefficients e.g. would not be helpful at all, as by including the residual in the dataset, the interception would be 0 and all the coefficients would result in 1.
That is why we looked for an additional version of the happiness dataset, which includes the actual raw values and which we can therefore use for analysing the variable importance and use in data dimension reduction steps.
Based on the happiness dataset we want to try to answer the follwing leading questions.
Can happiness be explained by certain factors? What are those factors and how much do they influence the happiness? For this questions we need the raw values to build our analysis on top. To answer this questions we decided to add additional factors which might explain the different happiness levels. We were interested in how drug abuse correlates with happiness and found suiting datasets for alcohol consumption and tabaco consumtion. Additionally we were intereseted in how the modern user of social media influeces happiness. However we only found a fitting internet dataset which captures the percentage of the individuals in a country which is using the Internet.
For the change of happiness we can use the plain happiness dataset as it captures the happiness scores and the explained by parts for the 6 factors over time. Therefore we can calculate an visualize the changes over time.
For answering our two main questions we decided for the given reasons to create two datasets.
| Country | Region | Happiness.Rank | Happiness.Score | Standard.Error | Economy..GDP.per.Capita. | Family | Health..Life.Expectancy. | Freedom | Trust..Government.Corruption. | Generosity | Dystopia.Residual |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Switzerland | Western Europe | 1 | 7.587 | 0.03411 | 1.39651 | 1.34951 | 0.94143 | 0.66557 | 0.41978 | 0.29678 | 2.51738 |
| Iceland | Western Europe | 2 | 7.561 | 0.04884 | 1.30232 | 1.40223 | 0.94784 | 0.62877 | 0.14145 | 0.43630 | 2.70201 |
| Denmark | Western Europe | 3 | 7.527 | 0.03328 | 1.32548 | 1.36058 | 0.87464 | 0.64938 | 0.48357 | 0.34139 | 2.49204 |
| Country | Happiness.Rank | Happiness | Economy | Family | Health | Freedom | Trust | Generosity | Year | Region |
|---|---|---|---|---|---|---|---|---|---|---|
| Switzerland | 1 | 7.587 | 1.39651 | 1.34951 | 0.94143 | 0.66557 | 0.41978 | 0.29678 | 2015 | Western Europe |
| Iceland | 2 | 7.561 | 1.30232 | 1.40223 | 0.94784 | 0.62877 | 0.14145 | 0.43630 | 2015 | Western Europe |
| Denmark | 3 | 7.527 | 1.32548 | 1.36058 | 0.87464 | 0.64938 | 0.48357 | 0.34139 | 2015 | Western Europe |
For answering the questions “What influences happiness?” we had to use the raw data of the factors and not their “explained by” values. In addition, we wanted to add futher factors and added the following three datasets:
By merging the datasets we have now four additional factors.
To join all the different datasets we had to do some preprocessing which can be seen in the preprocessing step. The main steps where cleaning the data (region, countrycode, NaN) and joining the datasets based on the year and the countrycode.
After joining we noticed, that the three additional data sets do not contain data for the whole timespan 2015-2022.(fig. missing values full data) Therefore, we decided to use only one year for analysing the influential factors.
missing values full data
We inspected the missing values of each year and choose the year with the lowes missing values, year 2018 (fig “missing values 2018”). Then we excluded all rows containing missing values again. Figure “missing values 2017” shows e.g. that the smoking and the alcohol dataset did not contain any values for the year 2017. We also renamed the columns for having shorter labels.
The final influential factors dataset consists of 96 rows (countries for the year 2018) and 18 columns which quickly explained. A more detailed explanation can be seen in the Statistical Appendix of the world happiness report.
| Country | Region | Year | Happiness | Economy | Social | Health | Freedom | Generosity | Corruption | Positive | Negative | Government | Code | Alcohol | Population | Tobacco | Internet |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Albania | Central and Eastern Europe | 2018 | 5.004403 | 9.412399 | 0.6835917 | 68.7 | 0.8242123 | 0.0053850 | 0.8991294 | 0.7132996 | 0.3189967 | 0.4353380 | ALB | 7.17 | 2882735 | 29.2 | 65.40000 |
| Argentina | Latin America and Caribbean | 2018 | 5.792797 | 9.809972 | 0.8999116 | 68.8 | 0.8458947 | -0.2069366 | 0.8552552 | 0.8203097 | 0.3205021 | 0.2613523 | ARG | 9.65 | 44361150 | 21.8 | 77.70000 |
| Armenia | Commonwealth of Independent States | 2018 | 5.062449 | 9.119424 | 0.8144490 | 66.9 | 0.8076437 | -0.1491087 | 0.6768264 | 0.5814877 | 0.4548403 | 0.6708276 | ARM | 5.55 | 2951741 | 26.7 | 68.24505 |
missing values 2017
missing values 2018
One of the objectives of preliminary data analysis to get a feel for the data you are dealing with by describing the key features of the data and summarizing the results. We are focusing on the second dataset, the influential factors dataset, which includes the raw values and not the explained by values.
First we check via the summary how all the explanatory variables are distributed. As we can see they are on different scales, especially “Health”,“Population” and “Internet”. As we don’t want to have data reduction analysis be more driven on the larges distances, we scale them by \(\frac{(x - mean(x))}{sd(x)}\)
## Happiness Economy Social Health Freedom Corruption Generosity Positive Negative Government Alcohol
## Min. :3.335 Min. : 6.630 Min. :0.5035 Min. :48.20 Min. :0.5286 Min. :0.1506 Min. :-0.33638 Min. :0.4347 Min. :0.1580 Min. :0.07971 Min. : 0.019
## 1st Qu.:4.702 1st Qu.: 8.570 1st Qu.:0.7396 1st Qu.:59.85 1st Qu.:0.7245 1st Qu.:0.6849 1st Qu.:-0.14312 1st Qu.:0.6427 1st Qu.:0.2132 1st Qu.:0.33120 1st Qu.: 4.280
## Median :5.536 Median : 9.669 Median :0.8581 Median :66.80 Median :0.8084 Median :0.7989 Median :-0.02550 Median :0.7353 Median :0.2749 Median :0.50385 Median : 7.410
## Mean :5.597 Mean : 9.394 Mean :0.8220 Mean :65.23 Mean :0.7945 Mean :0.7255 Mean :-0.01767 Mean :0.7114 Mean :0.2845 Mean :0.50944 Mean : 7.221
## 3rd Qu.:6.340 3rd Qu.:10.346 3rd Qu.:0.9130 3rd Qu.:71.20 3rd Qu.:0.8784 3rd Qu.:0.8559 3rd Qu.: 0.07377 3rd Qu.:0.8000 3rd Qu.:0.3509 3rd Qu.:0.64084 3rd Qu.:10.570
## Max. :7.858 Max. :11.454 Max. :0.9660 Max. :75.00 Max. :0.9699 Max. :0.9520 Max. : 0.49938 Max. :0.8836 Max. :0.5438 Max. :0.98812 Max. :15.090
## Population Tobacco Internet
## Min. :6.042e+05 Min. : 4.60 Min. : 8.00
## 1st Qu.:6.028e+06 1st Qu.:13.90 1st Qu.:30.80
## Median :1.585e+07 Median :22.80 Median :68.25
## Mean :5.380e+07 Mean :22.21 Mean :59.34
## 3rd Qu.:5.042e+07 3rd Qu.:27.95 3rd Qu.:81.62
## Max. :1.353e+09 Max. :45.50 Max. :97.32
We can see that every factor is now on the same scale. We have some outliers for Corruption, Generosity and Population.
On the correlation matrix plot we see, that happiness has the strongest correlation with Economy (0.801), Internet (0.786), Social (0.768) and Health (0.767). For the correlations between the explanatory variables the following stand out:
In this chapter we try to answer the question “What influences happiness?” by several methods of influential factors analysis.
One tool for getting a first glance on what influences happiness is linear regression. For the regression we use the unscaled data. If our linear model has good predictability, we can interpret the coefficients on how they influence the outcome. This is also called regression analysis, where the goal is to isolate the relationship between each explanatory variable and the outcome variable.
However, the interpretability assumes that you can only change the value of one explanatory variable and not the others at the same time. This of course is only true if there are no correlations between the explanatory variables. If this independence does not hold, we have a problem of multicollinearity. This can result in the coefficients swingging wildly based on which other independent variables are in the model. Therefore the coefficients become very sensitive to small changes in the model and can not be easily interpreted.
One way to asses how strong the explanatory variables are affected by multicollinearity is using the variance inflation factor (VIF). VIFs identify correlations and their strength. VIFs between 1 and 5 suggest that there is a small correlation, VIFs greater than 5 represent critical levels of multicollinearity where the coefficients are poorly estimated.
If we build a linear regression model on all explanatory variables, we get an R-squared of 0.8063. However, by plotting the VIF values we can see that a model based on all explanatory variables has severe multicollinearity. Therefore we can not interprete the coefficients for Internet, Health and Economy.
##
## Call:
## lm(formula = Happiness ~ ., data = not_scaled_data_factors)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.60190 -0.24719 0.00124 0.28565 1.79684
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.527e+00 1.487e+00 -1.027 0.3076
## Economy 3.317e-01 1.628e-01 2.037 0.0449 *
## Social 3.251e+00 9.952e-01 3.266 0.0016 **
## Health 7.641e-03 1.971e-02 0.388 0.6993
## Freedom 1.404e+00 8.833e-01 1.589 0.1159
## Corruption -1.247e+00 4.577e-01 -2.724 0.0079 **
## Generosity 7.633e-01 4.282e-01 1.783 0.0784 .
## Positive 6.045e-01 7.901e-01 0.765 0.4465
## Negative 2.332e+00 9.192e-01 2.537 0.0131 *
## Government -9.855e-01 4.520e-01 -2.180 0.0321 *
## Alcohol -3.825e-03 1.898e-02 -0.202 0.8407
## Population -3.861e-10 4.241e-10 -0.910 0.3654
## Tobacco -1.165e-02 7.295e-03 -1.596 0.1143
## Internet 6.001e-03 7.110e-03 0.844 0.4012
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.541 on 81 degrees of freedom
## Multiple R-squared: 0.8063, Adjusted R-squared: 0.7752
## F-statistic: 25.94 on 13 and 81 DF, p-value: < 2.2e-16
If we build a linear regression model without Internet and Economy, we get an R-squared of 0.7745. This R-squared is lower than prior, but after plotting the VIF values we can see that we are allowed to interpret the coefficients for the remaining explanatory variables, as all VIF values are below 5.
Interesting is that only Social, Health, Corruption, Negative and Government are statistically significant:
##
## Call:
## lm(formula = Happiness ~ . - Internet - Economy, data = not_scaled_data_factors)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.63299 -0.30363 -0.02198 0.34810 2.08143
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.609e+00 1.357e+00 -1.186 0.238858
## Social 4.646e+00 9.788e-01 4.746 8.54e-06 ***
## Health 5.214e-02 1.626e-02 3.207 0.001908 **
## Freedom 8.769e-01 9.265e-01 0.946 0.346660
## Corruption -1.616e+00 4.667e-01 -3.463 0.000847 ***
## Generosity 4.041e-01 4.430e-01 0.912 0.364406
## Positive 8.449e-01 8.287e-01 1.020 0.310893
## Negative 1.927e+00 9.682e-01 1.990 0.049879 *
## Government -1.016e+00 4.768e-01 -2.132 0.035974 *
## Alcohol 4.612e-03 2.003e-02 0.230 0.818514
## Population -1.135e-10 4.242e-10 -0.268 0.789649
## Tobacco -8.494e-03 7.700e-03 -1.103 0.273137
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5766 on 83 degrees of freedom
## Multiple R-squared: 0.7745, Adjusted R-squared: 0.7446
## F-statistic: 25.92 on 11 and 83 DF, p-value: < 2.2e-16
Next we tried out a linear regrssion method with shrinkage. For the lasso regression some estimates can become exactly zero. The result is therfore a type of variable selection and makes the model sparse and easier to interpret. For Lasso regression all predictor variables should be scaled so that they have the same standard deviation. Otherwise, the predictor variables have weighting in the penalty term. The glmnet() function however standardizes the predictors by default and the output coefficients are recalculated to apply to the original scale.
## [1] "Lasso Regression"
## 12 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 0.13892382
## Social 3.29213460
## Health 0.04780937
## Freedom .
## Corruption -0.62417854
## Generosity .
## Positive 0.12140602
## Negative .
## Government .
## Alcohol .
## Population .
## Tobacco .
The results of the lasso regression confirm our results from the normal regression for Social, Health and Corruption. However Positive is added and Negative and Government is removed from the model.
The change in the significant factors between the linear regression and lasso regression can be seen below:
Die Hauptkomponentenanalyse geht von der Annahme aus, dass es bei stark korrelierten Größen eine dritte Größe gibt, die nicht direkt messbar ist und die hinter diesen korrelierten Variablen steht und sich quasi in ihnen äußert. Das bedeutet, die messbaren Größen sind nur eine andere Erscheinungsform von Größen, die im Hintergrund stehen und nicht direkt gemessen werden können. Man nennt diese im Hintergrund stehenden Größen Hauptkomponenten (Principal Components), Latent Variables oder Faktoren. Ziel der Hauptkomponentenanalyse ist es, solche Hintergrundgrößen bzw. Faktoren aus den gemessenen Daten zu ermitteln und die beobachteten Zusammenhänge möglichst vollständig zu erklären. Mit Hilfe der Hauptkomponentenanalyse lassen sich demzufolge komplexe Informationen auf nur wenige, orthogonale Informationen verdichten.
Die Hauptkomponentenanalyse bestimmt die Faktoren nach rein mathematischen Gesichtspunkten. Da der erste Faktor immer in die Richtung der maximalen Varianz in den Daten zeigt, werden dadurch die real gemessen Informationen am besten repräsentiert.
sing a sample of six hundred participants, linear regression model was fitted and collinearity between predictors was detected using Variance Inflation Factor (VIF). After confirming the existence of high relationship between independent variables, the principal components was utilized to find the possible linear combination of variables that can produce large variance without much loss of information. Thus, the set of correlated variables were reduced into new minimum number of variables which are independent on each other but contained linear combination of the related variables. In order to check the presence of relationship between predictors, dependent variables were regressed on these five principal components. The results show that VIF values for each predictor ranged from 1 to 3 which indicates that multicollinearity problem was eliminated.
For the PCA we are using the scaled factors without the happiness score. The first two PCs explain 59.01 % of the variation together.
PC1 explains 39.07 % of the variation and the coefficients are the following:
\[PC1=-0.415*Economy+-0.397*Social+-0.395*Health+-0.174*Freedom+0.192*Corruption \\ +0.115*Generosity+-0.182*Positive+0.317*Negative+0.132*Government+-0.289*Alcohol \\ +0.069*Population+-0.164*Tobacco+-0.411*Internet\]
The first PCA plot colored by the rounded happiness scores, clusters the countries quite good. For low values on PC1 and PC2 we the really high happiness scores. The top 3 countries for 2018 (Finland, Denmark and Switzerland) are all in that region. Also interestting is that most of the countries in the lower left are from ‘Western Europe’, expecpt of ‘New Zealand’, ‘Australia’ and ‘Canada’ with are from ‘North America and ANZ’. When we move from left to right, the happiness scores decrease. The values 8,7,6,5,4 are quite good seperated. An exeption is the happiness category of 3. They are spread out on the right half side of the plot.
An interesting outlier ist Benin (BEN) on the middle right. Benin belongs to the happiness category 6 but is on the verry right side. Another outlier ist Botswana (BWA) which belongs to the happiness category 3 but is in the verry middle.
With the coefficients and the
coefficients
PC2 explains 19.94% of the variation and the coefficients are the following:
\[PC2=0.059*Economy+0.014*Social+0.054*Health+-0.478*Freedom+0.388*Corruption \\ +-0.396*Generosity+-0.384*Positive+0.108*Negative+-0.467*Government+0.054*Alcohol \\ +-0.078*Population+0.246*Tobacco+0.103*Internet\]
### What further influences happiness?
## Warning: Paket 'tidyverse' wurde unter R Version 4.1.2 erstellt
## Warning: Paket 'tibble' wurde unter R Version 4.1.3 erstellt
## Warning: Paket 'tidyr' wurde unter R Version 4.1.2 erstellt
## Warning: Paket 'purrr' wurde unter R Version 4.1.2 erstellt
## Warning: Paket 'dplyr' wurde unter R Version 4.1.3 erstellt
## Warning: Paket 'stringr' wurde unter R Version 4.1.2 erstellt
## Warning: Paket 'forcats' wurde unter R Version 4.1.2 erstellt
## Warning: Paket 'ggpubr' wurde unter R Version 4.1.3 erstellt
## Warning: Removed 157 rows containing non-finite values (stat_smooth).
Alt Text
geography map (color each country base on the percentage change over time (2015-2022))
## Warning: Paket 'pals' wurde unter R Version 4.1.3 erstellt